{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas常用数据类型及其操作\n",
    "## 一.Pandas读取和存储文件\n",
    "## 二.数据排序\n",
    "## 三.描述性统计分析\n",
    "## 四.层次化索引（补充）\n",
    "## 知识目标：掌握描述性统计分析的常用方法\n",
    "## 技能目标：掌握pandas读写不同类型文件的方法\n",
    "## 重点：描述性统计分析\n",
    "## 难点：层次化索引\n",
    "========================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一.Pandas读取和存储文件\n",
    "### 1.读写Excel文件\n",
    "#### （1）读取Excel文件\n",
    "to_excel()函数可以将Excel文件中的数据读取出来，并转成DataFrame对象。\n",
    "\n",
    "to_excel()函数语法：pd.read_excel(io,sheet_name=0,header=0,index_col=None ……)\n",
    "#io：接收字符串，表示路劲对象。\n",
    "#sheet_name：指定要读取的工作表，可接收字符串（工作表名）或int型（工作表索引）。\n",
    "#header：用于表示DataFrame的列标签，即指定某行数据作为列名。如果传入一个整数列表，则这些行会合并为一个MultiIndex对象(层次化索引)。\n",
    "#index_col：用作行索引的列编号或者列名，如果给定一个序列，则表示有多个行索引。\n",
    "#usecols：用来指定要读取的列的索引或名字。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>姓名</th>\n",
       "      <th>交易额</th>\n",
       "      <th>柜台</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>工号</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1001</th>\n",
       "      <td>张三</td>\n",
       "      <td>1664.0</td>\n",
       "      <td>化妆品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1002</th>\n",
       "      <td>李四</td>\n",
       "      <td>954.0</td>\n",
       "      <td>化妆品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1003</th>\n",
       "      <td>王五</td>\n",
       "      <td>1407.0</td>\n",
       "      <td>食品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1004</th>\n",
       "      <td>赵六</td>\n",
       "      <td>1320.0</td>\n",
       "      <td>食品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1005</th>\n",
       "      <td>周七</td>\n",
       "      <td>994.0</td>\n",
       "      <td>日用品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1002</th>\n",
       "      <td>李四</td>\n",
       "      <td>859.0</td>\n",
       "      <td>蔬菜水果</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1004</th>\n",
       "      <td>赵六</td>\n",
       "      <td>1668.0</td>\n",
       "      <td>日用品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1004</th>\n",
       "      <td>赵六</td>\n",
       "      <td>1722.0</td>\n",
       "      <td>日用品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1003</th>\n",
       "      <td>王五</td>\n",
       "      <td>1274.0</td>\n",
       "      <td>食品</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1006</th>\n",
       "      <td>钱八</td>\n",
       "      <td>812.0</td>\n",
       "      <td>食品</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>249 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      姓名     交易额    柜台\n",
       "工号                    \n",
       "1001  张三  1664.0   化妆品\n",
       "1002  李四   954.0   化妆品\n",
       "1003  王五  1407.0    食品\n",
       "1004  赵六  1320.0    食品\n",
       "1005  周七   994.0   日用品\n",
       "...   ..     ...   ...\n",
       "1002  李四   859.0  蔬菜水果\n",
       "1004  赵六  1668.0   日用品\n",
       "1004  赵六  1722.0   日用品\n",
       "1003  王五  1274.0    食品\n",
       "1006  钱八   812.0    食品\n",
       "\n",
       "[249 rows x 3 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#例1：读取‘超市营业额2.xlsx’文件中的“工号,姓名,交易额,柜台”四列数据，并将“工号”指定为行索引。\n",
    "import pandas as pd\n",
    "df = pd.read_excel(r'D:\\Desktop\\超市营业额2.xlsx',index_col=0,usecols=['工号','姓名','交易额','柜台'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "思考1：读取九九乘法表数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）写入Excel文件\n",
    "将文件存储为Excel文件，可以使用to_excel()方法。\n",
    "to_excel()语法：DataFrame.to_excel(excel_writer=None,sheetname='None',na_rep='',header=True,index=True,index_label=None,mode='w',encoding=None)\n",
    "#excel_writer:表示文件存放的路径\n",
    "#sheet_name:表示工作表的名称，可以接收字符串，默认为“Sheet1”\n",
    "#na_rep:表示缺失值\n",
    "#header：是否将列名写出\n",
    "#index:表示是否将行索引写出，默认为True\n",
    "#index_label:表示索引名\n",
    "#mode:表示数据写入模式\n",
    "#encoding：表示存储文件的编码格式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例2：将上周思考题3写入Excel文件，命名为“新生报到统计”\n",
    "data_demo = np.random.randint(30,100,(10,3))\n",
    "date_list = pd.date_range(start='202010100800',end='202010101700',freq='H')\n",
    "d6 = pd.DataFrame(data_demo,index=date_list,columns=['数据信息学院','艺术设计学院','医药护理学院'])\n",
    "d6.to_excel('D:\\pyData\\新生报到统计.xlsx',sheet_name=\"新生报到统计\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.读写csv文件\n",
    "#### （1）读取csv文件\n",
    "pandas提供read_csv()函数来读取csv文件,若要读取text文件，则使用read_table()函数来读取。\n",
    "\n",
    "read_csv()函数语法：\n",
    "\n",
    "pd.read_csv(filepath,sep=',',hearder='infer',names=None,index_col=None,dtype=None,encoding=utf-8,engine=None,nrows=None)\n",
    "#filepath:表示文件路径\n",
    "#sep:表示分隔符\n",
    "#header:表示将某行数据作为列名，默认为‘infer’，表示自动识别\n",
    "#index_col:表示索引列的位置\n",
    "#engine:表示数据解析引擎，接收c或Python，默认为c\n",
    "#nrows:表示读取前n行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例3：读取meal_order_info.csv文件中的前10行数据\n",
    "order_data = pd.read_csv(r'D:\\pyData\\meal_order_info.csv',encoding='gbk',nrows=10)\n",
    "order_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）写入csv文件\n",
    "将文件存储为csv文件，可以使用to_csv()方法。\n",
    "to_csv()语法：DataFrame.to_csv(path_or_buf=None,sep=',',na_rep='',columns=None,header=True,index=True,index_label=None,mode='w',encoding=None)\n",
    "#path_or_buf:表示文件存放的路径\n",
    "#sep:表示分隔符\n",
    "#columns:表示是否将列名写出\n",
    "#na_rep:表示缺失值\n",
    "#header：是否将列名写出，接收布尔值\n",
    "#index:表示是否将行索引写出，默认为True\n",
    "#index_label:表示索引名\n",
    "#mode:表示数据写入模式\n",
    "#encoding：表示存储文件的编码格式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.读写数据库文件\n",
    "#### （1）连接准备\n",
    "通过SQLAlchemy建立与数据库的连接，下载安装对应的连接工具\n",
    "\n",
    "下载PyMySql：\n",
    "pip install pymysql -i http://pypi.douban.com/simple/\n",
    "#### （2）创建连接\n",
    "使用create_engine() 函数创建连接\n",
    "\n",
    "语法格式：create_engine(‘数据库类型+数据库连接器://用户名:密码@机器地址:端口号/数据库名?字符类型’)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例4：创建连接\n",
    "import pymysql\n",
    "from sqlalchemy import create_engine\n",
    "engine = create_engine('mysql+pymysql://root:123456@127.0.0.1:3306/dsj?charset=utf8')\n",
    "engine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### (3)读取数据库文件\n",
    "#pandas读取数据库文件有3个函数：\n",
    "\n",
    "①read_sql_table（只能读某个表）：pd.read_sql_table (table_name,con)  #con=engine接受连接 无默认\n",
    "\n",
    "②read_sql_query（只能查询）：pd.read_sql_query (sql, con)\n",
    "\n",
    "③read_sql（既能读表也能查询）                   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例5：使用read_sql_table读取“学生信息表”数据库文件\n",
    "sql_data = pd.read_sql_table('学生信息表',con=engine)\n",
    "sql_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例6：使用read_sql_query读取“学生信息表”中班级为“大数据1902”的数据\n",
    "sql='select * from 学生信息表 where 班级=\"大数据1902\";'\n",
    "sql_stu2 = pd.read_sql_query(sql,con=engine)\n",
    "sql_stu2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （4）写入数据库\n",
    "\n",
    "使用to_sql()方法将数据写入到数据库中\n",
    "\n",
    "to_sql()语法：DataFrame.to_sql (name,con,if_exists='fail',index=True,index_label=None)\n",
    "#if_exists：可以取值为fail,replace,append，默认为fail。\n",
    "\n",
    "fail：如果表存在，则不执行写入操作\n",
    "\n",
    "replace：如果表存在，则将源数据库表删除再重新创建\n",
    "\n",
    "append：如果表存在，则在原数据库表的基础上追加数据\n",
    "#index:表示将DataFrame行索引作为数据传入数据库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例7：将上周思考题3写入数据库，命名为“新生报到统计”\n",
    "data_demo = np.random.randint(30,100,(10,3))\n",
    "date_list = pd.date_range(start='202010100800',end='202010101700',freq='H')\n",
    "d6 = pd.DataFrame(data_demo,index=date_list,columns=['数据信息学院','艺术设计学院','医药护理学院'])\n",
    "d6.to_sql('新生报到统计',con=engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二.数据排序\n",
    "pandas对象是索引和数据的结合，所以它既可以按索引排序，也可以按数据排序。\n",
    "### 1.按索引排序\n",
    "使用sort_index()方法可以按行索引或者列索引进行排序\n",
    "\n",
    "sort_index()语法：DataFrame.sort_index(axis=0,level=None,ascending=True,inplace=False)\n",
    "#level：若不为None，则对指定索引级别的值进行排序\n",
    "#ascending：是都为升序排序，默认为True\n",
    "#inplace：表示是否将排序的结果创建为新的实例，默认为False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例7：将d6中的数据按照行索引进行降序排列\n",
    "d6.sort_index(axis=0,ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.按数据排序\n",
    "使用sort_values()方法可以按值进行排序\n",
    "\n",
    "sort_values()语法：DataFrame.sort_values(by,axis=0,ascending=True,inplace=False,na_position='last')\n",
    "#by:表示参与排序的列\n",
    "#na_position:此参数有两个值，first和last，若为first，则将NaN值放在开头，否则将放在最后。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例8：将d6中数据信息学院的数据进行升序排序\n",
    "d6.sort_values(by=['数据信息学院'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 三.描述性统计分析\n",
    "描述性统计是用来概括、表述事物整体状况，以及事物间关联、类属关系的统计方法。通过几个统计值可以简单地表示一组数据的集中趋势和离散程度。\n",
    "pandas描述性统计方法：\n",
    "#min\t最小值\n",
    "#mean\t均值\n",
    "#median\t中位数\n",
    "#var\t方差\n",
    "#sem\t标准误差\n",
    "#skew\t样本偏度\n",
    "#quantile\t四分位数\n",
    "#max\t最大值\n",
    "#ptp\t极差\n",
    "#std\t标准差\n",
    "#cov\t协方差\n",
    "#mode\t众数\n",
    "#kurt\t样本峰度\n",
    "#count\t非空值数目\n",
    "#mad\t平均绝对离差\n",
    "#describe\t描述统计：能够一次性得出数值型特征的非空值数目，均值，四分位数和标准差\n",
    "### 1.数值型特征的描述性统计"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count      246.000000\n",
       "mean      1330.313008\n",
       "std        904.300720\n",
       "min         53.000000\n",
       "25%       1031.250000\n",
       "50%       1259.000000\n",
       "75%       1523.000000\n",
       "max      12100.000000\n",
       "Name: 交易额, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#例9：查看“超市营业额”中的交易额数据，并使用describe对交易额数据进行描述性统计\n",
    "df['交易额'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.类别型特征的描述性统计\n",
    "描述类别型特征的分布状况，可以使用value_counts()方法进行频数统计"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例10：查看“超市营业额”中的交易额数据，并使用value_counts()方法对柜台数据进行频数统计\n",
    "df['柜台'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.category类型的描述性统计（补充）\n",
    "category类型：分类数据，可以理解成取值有限的，或者是固定数量的可能值。例如，性别，血型。\n",
    "\n",
    "describe方法除了支持传统数值型数据以外，还能够对category类型的数据进行描述性统计，统计的4个统计量分别是：\n",
    "#count（非空元素的数目）\n",
    "#unique（类别的数目）\n",
    "#top（数目最多的类别）\n",
    "#freq（数目最多类别的数目）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count      249\n",
       "unique       4\n",
       "top       蔬菜水果\n",
       "freq        63\n",
       "Name: 柜台, dtype: object"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#例11：用describe方法查看“超市营业额”中的柜台数据特征描述\n",
    "df['柜台']=df['柜台'].astype('category')  #使用astype将目标特征的数据类型强制转换为category类型\n",
    "df['柜台']\n",
    "df['柜台'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 四.层次化索引（补充）\n",
    "详见word文档"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
